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Info  rraatio  n-  The  o  re  tic  Props  ?'tie  s  of 

Languages  and  their  Grammars 

Bruce  J.  MacLennan 

Computer  Science  Department 

Naval  Postgraduate  School 
Monterey,  CA  93943 

Abstract:  We  describe  means  for  computing  a  number  of  information-theoretic  proper- 
ties of  languages  and  their  grammars.  For  example,  the  entropy  of  a  system  of  sym- 
bols is  widely'recognizsd  as  a. measure  of  that  system's  comple:dty  and  organization. 
We  show  how  the  entropy  of  a  language  can  be  computed  in  a  simple  way  from  a  gram- 
mar annotated  '.nth  production  probabilities.  V,"e  then  develop  means  for  statistically 
estimating  these  production  probabilities  from  measurable  properties  of  strings  in  the 
language.  V,*e  also  consider  'he  computation  of  other  information  theoreti  :  proper1  -r 
of  languages  and  grammars,  such  as  the  average  information  born  by  a  symbol  U-.  a 
language  and  the  average  information  used  by  the  productions  of  a  grammar. 

1.  Introduction 

The  entropy  of  a  system  is  widely  recognized  as  a  measure  (actually,  a  reciprocal 
measure)  of  that  system's  organization  and  structure  [Shannon,  Brillouin,  Hamming, 
McKay,  Cherry].  This  suggests  that  the  entropy  of  a  language  might  be  an  important 
property  to  measure  to  form  a  basis  for  the  quantitative  comparison  of  languages.  For 
this  reason  we  have  developed  means  for  computing  the  entropies  of  languages. 
Specifically,  we  derive  formulas  for  computing  the  entropy  of  a  language  from  a  gram- 
mar for  that  language  that  has  been  annotated  with  the  probabilities  of  its  productions 
being  applied.  We  also  show  techniques  whereby  these  production  probabilities  can  be 
inferred  from  statistical  properties  of  strings  in  the  Language.  Finally,  we  apply  the 
same  techniques  to  several  related  issues,  such  as  determining  the  average  derivation 
length  of  a  grammar,  and  the  average  information  consumed  by  a  grammar  during 
string  generation.  Tnese  all  seem  to  show  premise  as  a  means  for  making  quantitative 
language  comparisons. 
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2.   Satrcny  cf  a  L^ii^uc^s 

2.1   Daxliiilicn  cf  Unlrcpy 

Suppose  !]  is  a  finite  system  of  symbols  tXi.ao,  .  .  .  ,  f7;;,  in  which  symbol  -.  had    -. 

probability  of  occurrence  pz.   Naturally,  Vpi  =  i.   The  entropy  of  £  is  defined 

#(2)    =    -tjHlgCVpi). 

i  =  i 

where  Ig  x  =  logg^.    Since  the  entropy  does  not  depend  on  the  symbols  _,.,  and  is  com- 
pletely determined  by  the  probabilities  pi,  it  is  simpler  to  denne  the  entropy  in  terms 

of  the  a  priori  probability  distribution.    The  entropy  of  the  finite  discrete  probability 
distribution  p^po,  .  .  ,  ,p:.  is 

H(Pi>Pz Pk)   =    2ftlg(l/Pi)   =    -LPi^gPi- 

i=i  i=i 

The  preceding  ideas  are  easily  extended  to  infinite  discrete  probability  distributions. 

as 

Suppose  Y)pi  -  1.    We  define  the  entropy  of  this  distribution: 


i=i 


H(pi,pZ,   ■   •   •  )     =     E?llS(l/Pi)     =     -TtP^gPi- 
i=l  i=l 

Note  that  T^Pi  -  1  does  not  guarantee  the  convergence  of  ^JPi  IgP-;-    That  is,  there  are 
i 

probability  distributions  that  do  not  have  an  entropy.  Take,  for  example, 
p,  =  C/(i  Imx).  The  sum  ^Pi  converges,  but  the  entropy  VjDllg  pi  does  not.  For- 
tunately, these  troublesome  distributions  do  not  seem  to  occur  in  practice. 

Entropy  is  widely  recognized  as  a  measure  of  disorganization,  and  thus  lack,  of  struc- 
ture [Brillouin].  When  organization  increases,  entropy  decreases;  when  entropy 
increases,  structure  decreases.  Thus  it  is  usually  more  convenient  to  work  with  negcn- 
tropy  rather  than  entropy.  The  negentropy  of  a  system  is  simply  the  negative  of  its 
entropy.    Thus,   when   organization  increases,   so  dees   negentropy;   when  negentropy 


decreases,  so  dees  structure.    The  negentropy  H  of  a  discrete  distribution^,  is  denned 


i 


2.2  The  Entrcpv  of  a  Lzn^ticsc 

A  language  I]  is  a  (usually  infinite)  set  cf  strings 

I    =    (CTi.Jn,  .  .  .  ,  a, ,  .  .  .  \. 

Now,  let  P{ox)  be  the  a  priori  probability  cf  occurrence  cf  the  string  u:  in  S.  The 
negentropy  of  the  language  I!  is  simply 

In  most  interesting  cases  the  number  of  strings  in  a  language  is  infinite.  The  entropy  cf 
an  infinite  language  is  thus  defined  in  terms  of  an  infinite  set  of  probabilities.  There- 
fore for  most  languages  we  are  able  to  calculate  the  entropy  only  when  there  is  some 
finite  description  for  that  infinite  set  of  probabilities,  that  is,  when  there  is  some  struc- 
ture In  that  infinite  set  of  probabilities. 

Although  useful  languages  are  usually  infinite  (i.e.,  comprise  an  Infinite  number  cf 
strings),  they  can  be  described  finitely  by  a  grammar.  That  is,  the  grammar  reflects 
the  finite  structure  in  the  infinite  set  of  strings.  This  suggests  a  solution  to  the  prob- 
lem of  finding  a  finite  description  of  :he  infinity  of  probabilities  associated  with  the 
strings  in  the  language. 

The  generation  of  each  string  in  a  language  requires  a  finite  number  of  elementary 
choices  to  be  made.  For  example,  in  a  grammar  for  arithmetic  expressions  there 
might  be  two  productions  for  a  nonterminal  v: 

v  -*   + 
v  -*   - 

In  deriving  a  string  from  this  grammar,  the  symbol  V  can  be  replaced  by  either  '+'  cr 
'-';   a  choice  must  be   made.    Thus,   a  finite  sequence   of  choices   n^,  rr2,  .  .  .  ,  rrfc    are 


necessary  to  determine  each  string  in  the  language.  Conversely,  if  the  j.'  ■  nni  u  u 
unambiguous,  there  is  a  unique  such  sequence  for  each  string  in  the  language1. 

Now  suppose  that  each  elementary  choice  77;  had  an  a,  priori  probability  '  ~.<  of 
being  made,  if  these  probabilities  ere  independent,  then  the  probability  _."  '.h.  re-ult- 
ing  string  being  generated  is 

P{-X)P{-Z)   ■  ■  ■   P(irb). 

Thus,  associating  a  probability  with  each  elementary  choice  permitted  by  the  grammar 
induces  a  probability  on  each  string  generated  by  the  grammar.  Y.'e  coil  a  grammar 
with  such  associated  probabilities  an  annotated  grammar. 

There  is  of  course  no  guarantee  that  the  probabilities  induced  by  an  annotated 
grammar  are  in  fact  the  a  priori  occurrence  probabilities  of  the  strings  in  the 
language.  Indeed,  an  annotated  grammar  is  a  model  of  the  processes  that  in  reality 
generate  strings  in  the  language.    As  such,  it  might  or  might  not  be  a  good  model. 

Vie  say  that  an  annotated  grammar  predic ts  a  language  if  it  generates  that  language 
and  induces  on  its  strings  their  actual  a  priori  probabilities  of  occurrence.  Vie  call  a 
language  predictable  if  there  is  an  annotated  grammar  that  predicts  it.  Clearly  then, 
we  can  determine  the  probabilities  of  the  strings  in  a  predictable  language  if  we  can 
find  an  annotated  grammar  that  predicts  that  language.  Further,  if  we  can  calculate 
the  entropy  of  the  language  generated  by  an  annotated  grammar,  then  we  will  be  able 
to  calculate  the  entropy  of  the  predictable  language.  In  the  following  sections  we 
develop  means  for  computing  entropies  from  annotated  grammars. 

3.    Computing  Negentrcpy  from  Grammars 

3. 1   Annotated  Regular  Grammars 

Vie  begin  our  analysis  with  a  particularly  simple  class  of  languages:  regular  languages 
[Gins burg,   Hopcroft  k  Ullman].    The   advantage   of  beginning  with  them  is  that  the 


1.    More  precisely,  in  an  unambiguous  grammar  -.here  is  a  unique  leftmost  derivation  for  each  siring. 


-4- 


grammar  for  a  regular  Language  con  be  written  as  a  single  nonrecursive  production 
maiang  use  of  only  a  few  simple  operators.   These  operators  are: 

name  notation      interpretation 


catenation  A3  an  A  followed  by  a  B 

alternation  A'-£3  an  A  or  a  B 

Kleene  star  A*  zero  or  mere  ,4s 

Kleene  cross  A*~  one  or  "more  .-Is 

Any  regular  language  can  be  described  by  an  expression  formed  from  the  empty  string 
(s),  individual  tokens  and  these  operators,  appropriately  parenthesized2.  Such  an 
expression,  which  defines  a  regular  language,  is  called  a  regular  expression. 

For  example,  the  regular  language  of  signed,  nonnull  digit  strings  is  defined  by  the 
regular  expression: 

(+S-0£)(O$l©2©3S4©5®6e?$3©9)  + 

This  expression  can  be  read,  "a  plus  or  a  minus  or  nothing,  followed  by  a  string  of  one 
or  more  digits." 

As  discussed  in  Section  2,  to  compute  the  entropy  of  a  language  it  is  necessary  to 
know  the  probabilities  of  the  strings  in  that  language.  If  we  have  an  annotated  gram- 
mar that  predicts  the  language  that  it  generates,  then  the  probabilities  of  these  strings 
can  be  computed  from  the  production  probabilities  (choice  probabilities)  in  the  gram- 
mar. 

In  deriving  a  string  from  a  regular  expression  there  is  only  one  situation  in  which  a 
choice  can  be  made:  from  A®B  we  can  derive  either  an  4  or  a  3.  Thus  we  can  anno- 
tate a  regular  expression  by  associating  probabilities  with  all  the  alternands  of  an 
alternation.   We  write  the  probabilities  immediately  preceding  the  alternands  that  they 


2.  V/e  have  used  ,fF  instead  of  the  usual  '",  since  the  latter  could  be  confused  with  conditional  probabilities, 
conditional  entropies,  etc.  Readers  unfamiliar  with  the  regular  languages  and  other  concepts  from  formal 
language  theory  should  consult  any  standard  text  on  the  subject  (e.g.,  Gir.sburg  or  Hopcroft  i  UUman). 


are  associated  with: 

p  A   a  p  B. 

This  means  that  we  can  chcose  an  ,4  with  probability  p,  or  d  3  v.iiih  probability  p. 
(Here,  and  throughout  this  paper,  we  use  p  as  an  abbreviation  for  I— p.} 

Since  one  of  the  alternands  must  be  chosen,  their  probabilities  must  add  Lo  unity. 
This  is  the  case  above,  since  p+p  -  1.  It  also  applies  if  there  are  mere  tho.t  two  alter- 
nands.  For  example,  if  T.ve  have 

p,.4,  Sp?A2  3        ■    Spn.-ln 

then  we  must  have  pi+p^H-  ■     ■  +pn  =  I. 

The  following  sections  develop  entropy  formulas  that  can  be  recursively  applied  to 
any  annotated  regular  expression  to  compute  the  entropy  of  the  regular  language 
predicted  by  that  expression.  When  these  results  have  been  obtained  we  will  show  that 
they  can  be  easily  extended  to  the  computation  of  the  entropy  of  any  context-free 
language  from  a  grammar  that  predicts  it. 

3.2  Entropy  Formulas 

We  derive  a  series  of  formulas  that  can  be  applied  recursively  to  an  annotated  regular 
expression  to  compute  the  entropy  of  the  language  predicted  by  that  expression.  In  ail 
cases  we  assume  that  the  regular  expression  is  an  unambiguous  grammar  (i.e.,  there  is 
only  one  way  to  generate  a  given  string),  and  that  the  choices  leading  to  a  given  string 
are  independent. 

We  begin  with  the  simplest  regular  expressions,  the  empty  string  and  individual 
tokens,  and  proceed  to  the  catenation  and  alternation  operations. 

Theorem:  If  £  is  the  set  containing  just  the  empty  string  and  r  the  set  containing 
just  the  individual  symbol  r  then 

H(s)   =   H(r)   =   0. 
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Proof:  By  definition  l(e')  =  \il  and  L(r)  -  \t\.   Since  there  is  on!<  one  symbol  in  ±acb 
of  these  languages,  its  apriori  probability  is  1.    Hence, 

HU)    =   H(r)    =    1-lgl    =    0. 

TtiGorein:  H(A£)   =  H{A)  +  B(3). 

Proof:  Suppose 

L(A)   =   i«i,a2,  .  .  .  J, 
MS)    =   Wi.ft.  -  ■     |. 

Then 

Let  P^(at)  be  the  probability  of  .boosing  a*  from  I (.4)  and  Pb(,S-)  the  probability  of 
choosing  ,5j  from  L(3),  then,  since  we  are  assuming  these  choices  are  independent,  the 
probability  of  choosing  c,lt3j  from  L(AB),  P,is(cti/?/).  is  just  PA(ai)p3(fij)-  Now,  let 
Pi  =  Pj(cO  and  gi  =  Pg({3*).   Then,  by  factoring  and  distributing: 

I. J 
=    ZPdMPBtfjMPAMPBtfi)] 

*        j 

=    SPitS^OgPi   +  Ig  7;)] 

; 

=    Vl)-[Vi-'a  D.    +     •    n    \a  or  - 1 

Now,  since  2?j  =  -  ^d  &{3)  ~  S?;l§  ?;'■ 

/  T 
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Since  ^ot  =  1  and  H{-\)  =  Ylpjgpi,  we  have 

t  i 

H(AB)   =   H(A)  +  H{B). 
Q.ED. 

Definition:  The  n-fold  catenation  of  .4  with  itself,  An,  is  defined: 

A0    =    B. 

A1   =   A, 

An+l    =   AAn,  forn>0. 

Corollary:  H(An)   -   nH(A). 

Proof:  V.rs  prove  the  result  inductively. 

H(AC)    =   H(s)    -   0   =   OH  (A). 

Similarly, 

H(Al)    =    £(.4)   =    1  7/(.4). 

Proceeding  inductively  for  n>0, 

H{Ar-1)    =   H(AAr-) 

=  H{A)  4-  ^(.4n) 
=  7?(^)  +  nH(A) 
=   (n  +  l)H(A). 

Q.E.D. 

Theorem:  H\pA  ®pB\   -   H(p,p)  +  pH(A)  +  pH(B). 

Proof:  Let  G  =  p/4  ®pB.  Then,  to  generate  a  string  in  L(G)  we  must  make  a  choice; 
with  probability  p  we  pick  a  string  from  Z,(.4),  with  probability  p  we  pick  a  string  from 
L(B).  Let  a  be  a  string  in  L(G).  Since  we  are  only  considering  unambiguous  gram- 
mars, a  must  have  come  from  either  L(A)  or  L(B).  Suopose  that  a€.L(A).  Since  the 
probability  of  a  selection  from  L(A)  is  p,  and  the  probability  of  getting  a  when  a  selec- 


Lien  is  made  from  L{A)  is  P^{g),  the  probability  Pq(o)  of  selecting  a  from  L{  1)  i: 
pP.\(a)-  Simiiarly,  if  azL{B)  then  Pj(a)  =  pPjicr).  These  observations  allow  us  to  com- 
pute the  entropy  of  G. 

o£L{A)  oeL{B) 

=  E^oO  lg  PgM  +  SPc(^)  lg  ^c(fy) 

i  3 

=   %pPA(at)  IgP^j(at)  +  IpF3(/3j)  \£?Pa(fij) 

i 

-  &P<1§  PPi   +  £>?j  tg  ?7j 

i  ; 

=    i^A  Lg  PPi  +  ?S  ?;  lg  £<?j 

i  J 

=  ?SPi(lg?  +  lgPi)  +  PZ?/(lgP  +  [§  ?;') 
i  ;' 

From  the  definitions  of  //(/I)  and  H(B)  and  the  fact  that  the  p:  and  7t  sum  to  1  we  get 

H{G)    -  p[lgp  +  77U)]  +p[lg?  +■  H(B)] 

-  pig  p  +  pig  p  +  pH(A)  +  pH{3) 
=   ff(p,p)  +F(/1)  +  /?(£). 

This  result  is  easy  to  generalize  to  the  n-fold  alternation: 

Theorem:  The  negentropy  of  an  rt-fold  alternation  can  be  computed: 

H\PiAi®PzAz©  ■  ■  ■  ®pnAn\    =  H(pltp2 Pn)  +  tPjHiAj). 

Proof:  Let  G  =Pi^i  Sp^2®  ■  ■  ■  3pnAn.  For  each  ai>;-  zL(Aj)  let  g4j-  =  P^(a<j).  The 
proof  is  a  simple  generalization  of  the  previous: 
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H{G)    =      V    PG{a)\gPG(a) 

«76i(C) 

=    S      E    Pa(o)\gPG{a) 

j=l  ozL[,\j) 

n 

=    ^  ^  r j  .7 :  j  Ig  n;-  rr.  _ j 
/^iT 

n 
=    SPj[£gi.;lgPj7i.;] 

;=1  t 

n 
;=1  1  i 

Since  V]?lJ  =  1,  H'iphp2,     ■  ■  ,pn)  =  £iV§P;  and  #Uj)  =  E^ftjte  3ij. 
1  j  =  1 

n  

=    H(jil,p2 Pn)  +   SPj^U;)- 

Q.E.D. 

The  ©  operation  is  associative,  that  is, 

A£(B®C)   -  A®B®C  -   (A®B)®C. 

We  would  expect  the  annotated  version  of  this  operation  to  also  be  associative: 

pA  ®p(qB  ®  qC)    =   pA  ®pqB  ®pq~C. 

Thus,  if  our  negentropy  formula  is  correct,  we  should  get  the  same  value  for  the  negen- 
tropy  of  each  of  these  regular  expressions. 

Thecrem:  H\pA  ®p{qB  3  qC)\    =   H\pA  SpqB  ®pqC\. 

Proof:  We  derive  the  negentropy  of  the  right-hand  expression: 

H\pA®pqB  ®pqC\    -    H(p,pq,pq)  +  pH(A)  +  pqH{B)  +  pqH(C). 

-10- 


Next,  we  derive  the  negenlropy  of  the  left-hand  side  and  show  it  equals  the  expresr.dc  n 
above: 

H\pA  ®p(qB  ®qC)\ 

•■'U     'i     /  ^  \--/  i U-      •"    3-') 

=    ^"(P.P)  +pH(A)  +  p[H(q:q)  +  qH(B)  +  qH(C)] 
=    ^C°.P)  +  P#U)  +  pHiq.q)  +pqH(B)  +  pqH(C) 
=   #(P>P)  +pH{q,q)  +pH(A)  +  pqH(3)  t  pqH(C). 

Thus  it  remains  to  show  that 

H(j),pqpq)    =    Hipp)  +  pH(a,q). 

Expanding  the  right-hand  side  above,  rearranging,  and  recalling  that  q+q  =  1,  we  get 

H{pp)  +pH(q,q) 
=   pig  p  +  p lg  p  +  pq  Ig  q  +  pq  Ig  5 

=  pigp  +  P(q+q)  igp  +  pqig  q  +■  pgig  q 
=  pig  p  +  pg  i§  p  +  pg  ig  p  +  pg  ig  g  +  pq  ig  ? 
=  pig  p  +■  p?  (ig  p  +  ig  g)  +  pg  (ig  p  +  ig  g) 
=  pigp  +pgigpg  +Pq^gpq 

=    H(p,pq,pq). 

Q.E.D. 

We  now  consider  the  iterative  constructs  in  regular  grammars.    The  Kleene  cross, 
.4+,  means  one  or  more  4s.    Thus  .4*  can  be  expanded  as  the  infinite  alternation: 

A+   =  .49^9^9 

It  can  also  be  defined  by  the  recursive  formula: 

.4+    =   ,4  9  A4+. 

This  kind  of  regular  grammar  is  converted  to  an  annotated  grammar  by  adding  a  con- 
tinuation probability  p : 
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Ap  +   =  pA  £>  ppA2  ©  p2pA3  <B  ■  ■  ■ 

or  in  its  recursive  form 

Ap¥    =  pA  ®pAAp*. 

Y/e  will  derive  the  negentropy  formula  two  ways,  using  both  the  infinite  alternation  and 
recursive  definitions,  and  show  that  we  get  the  same  result. 

Theorem:  H\AP*\    -   [H(p,p)  +  H(A)]/p. 

Proof:  First  we  use  the  recursive  formulation: 

AP*    =  pA  £p.4_4-p^. 

Taking  the  negentropy  of  both  sides  we  have: 

H\Apri    =  H\pA  SpAAp*\ 

=  X(p,p)  +pH(A)  +  pH\Ai*+\ 

=  H(p,p)  +  pH(A)  +p[H{A)  +  H\AP*]] 

=  H(p,p)  +  pH(A)  +pH(A)  +  pH\A?+] 

=  H(p,p)  +  H(A)  +pH[AP+l 

Solving  now  for  H\AP*]\ 

(l-p)H\Ap  +  l  =  H(p,p)  +  HU). 
Hence, 

H\Ap+\    =    [H(p.p)  +  H(A)]/p. 

Q.E.D. 

Next  we  compute  the  negentropy  directly  from  the  infinite  expansion  of  the  itera- 
tion: 

H{Ap+\    =   H\pA@ppAz®p2pA3®  ••  •    j 

=   H(p,pp,pp2,  .  .  .)  +pH(A)  +ppH(A2)  +  pp2H(A3)  +    ■  •  •  . 

Recalling  that  H(An)  =  nH(A), 
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=   H{p,pp,pp2,  .  .  .  )  4-  p[H(A)   H  Zpfl(A)  +  3p^(,i)  +         ■  ] 
=   H{p,pp,ppz )  +  p[l  +  2p+3p:  +  ■  ■  ■  ]H(A). 

Now  note  Lhat  if  \p  |  <1  the  power  series  expansion  of  1/p"  is 

1/p2    =  l  =    1  -  2p  +  Cp2  +-    •  •  ■  . 

(i-pr 

Therefore 

7?Mpf}    =   H(p,pp,pp2,  .  .  .  )  +p(l/p2)i*Cd) 
=   H(p,pp,ppz,  .  .  .)  +  J7U)/p. 

It  remains  to  simplify  Hip  ,pp,pp2,  .  .  .  ). 

Hippppp'2,  .  .  .  )    =  pLgp  +  pplgpp  i-pp^gpp2  +    •  •  ■ 

=  £pp*igpp* 

fc=0 

=  p£p*lgP*P 

OP 

=  p~£pfc[ig  p*  ~  ^p] 

fc=0 

=  pr^Dfci^  p*  +  v  pteiT  pi. 

fc=0  ib=0 

Now  note  that  if  |p  |  <1  the  power  series  expansion  of  1/p  is 

up  =  -4—=  i+p+p2+p3+  ■  ■  ■  =  £p*. 

i-P  te=0 

Therefore, 

en 

H(p,pp,pp2,  .  .  .  )    =  p[£&p*lgp  +  (l/p)lgp] 

fc=0 

=  [p^gp  £'lP'"l  +  igp 

fc=0 


=   [pplgp   £&p*   I]+lgp. 
fc=i 


Using  again  the  power  series  expansion  for  1/p-  we  have 
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H(p,pp,pp",  .  .  .  )   =  ppigp(l/p")  +  Igp 
=   (plgP)/p  +  lg£ 

=   H(p,p)/p. 

Therefore, 

#M?'j    =   H(p,p)/p  +H{A)/p    =   [H(p,p)+H(A)]/p. 

Q.E.D. 

The  Kleene  star,  a',  means  zero  or  more  repetitions.    Thus  it  can  be  denned  by  the 
infinite  expansion 

A*    =   A0  £Al  £  A?--±      ■  ■  . 

where  ,4°  =  £  and  ^l  =  ,4.    Since  the  expression  following  the  first  ©  is  just  the  definition 
of  A1",  the  above  equation  can  be  written 

A*   -   r®.4f. 

The  Kleene  star  can  also  be  defined  recursively: 

A'   -   s®AA*. 

This  notation  is  annotated  by  attaching  a  continuation  probability  p  to  the  star: 

Ap*   -  ps  BpAA*. 

The  following  theorem  defines  its  negentropy. 

Theorem:  H\AP'\    =   [Hipp)  +  pH(A)}/ p. 

Proof:  There  are  several  ways  to  prove  this  result,  corresponding  to  the  alternate 
definitions  of  A*. 

(l)  First  we  derive  the  negentropy  of  Ap*  from  the  negentropy  of  Ap+.    Since 

AP'   =  ps®pA?+, 

we  can  apply  H  to  both  sides: 
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H\,\?0]    =   H\ps  Bpj>+] 

=   H(p,p)  +  pH(s)  +  pH\A**\ 
=  H(p,P)  +  p[E(p,p)  +  H(A)]/p 
=   [pH(p.p)  +  pH(pp)  +  H(A)]/p 
=   [H(p,p)  +  pH(A)]/p. 

Q.E.D 

(2)  V/e  can  also  compute  the  negentropy  from  the  recursive  equation 

Ap'   -  ps  fBpAAp*. 

We  apply  H  to  both  sides  and  get 

H\Ap'\    =   H\pt  ®pAAP'\ 

=   H(p,p)  +  pH(s)  +  pH\AA?'\ 

=   H{p,p)  +pH(A)  +pH\A?'l. 

Grouping  the  unknowns  on  the  left  produces 

(l-p)H\Ap'\    =   H(p,p)+pH(A). 

Recalling  thatp  =  l—p  we  have 

H\AP']    =    [H(p,p)+pH(A)]/p. 

Q.E.D. 

(3)  Finally,  we  derive  the  negentropy  formula  from  the  infinite  expansion 

AP'    =  ps  QppA  Qp^A2  3 

Take  the  negentropy  of  both  sides  to  get 

H\AP']    =  H\ps  9 pp.4  3p2p\42Sp3p43  ©   •  •  •  J 

=  H (p ,pp ,p2p ,p2p ,  .  .  .  )  +  ppH(A)  +  p2pH(Az)  +  p3pH(A2)  +    •  ■ 

=  H  (p  ,pp  ,p2p  ,p2p ,  .  .  .  )  +pp(l  +  2p  +  3p2  +    ■  •  ■  )#(4) 

=  H (p ,pp ,p2p ,p3p ,  .  .  .  )  +  (p/p)H(A), 
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where  we  have  used  l/ps  =  1  +  2p  +  3d2  +   •  •  •  .    T,Ve  have  already  shown  in  the    . 
tion  of  H\Ap*l  that 

H(p,pp,ppz,  .  .  .  )    =   H(p,p)/p. 

Therefore  we  have 

/7<(A*'j    =   H(p,p)/p  +  (p/p)H(A) 
=   [#(P.P)  +pH(A)]/p. 

Q.E.D. 

V/'e  can  check  these  results  by  computing  the  negentropy  of  Ap+  baced  en  equation: 

A'D  h    =   .  [  !-D  * 
Applying  if  to  both  sides  we  derive 

=   /7(A)  +  H\AP'] 
=   H(A)  +  [H(p,p)  +pH{A)}/p 
-   [H(ji,p) +pH(A) +pH(A)]/p 
=   [H(p,p)  +H{A)]/p, 

which  checks  with  our  previous  result. 

The   formulas   for   computing    the    negentropy   (and   hence   entropy)    of   a   regular 
language  are  summarized  in  Table  1. 

TABIJC  1.    Formulas  for  Negentropy  of  Regular  Languages 


H\e\ 

— 

0 

H\t] 

= 

0 

H\AB\ 

= 

H(A)  +  H{3) 

H\pA  SpB\ 

— 

Hjp.p)  +pH(A)  +pH(3) 

H[A?  +  ] 

= 

[H(p,p)  +  H(A)]/p 

H\A*'\ 

— 

\H(p,p)  4-  pH(A)]/p 

3.3  Examples 

In  this  section  we  illustrate  the  application  of  our  negentropy  formulas  with  several 
simple  examples.    Several  of  these  examples  are  based  on  free  languages: 
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Definition:  The  free  language  on  an  alphabet  T  is  the  set  of  all  finite  ft. in- 
ing  the  empty  string)  of  elements  of  T. 

Thus  T*  is  the  free  language  on  T.    In  most  cases  it  does  no;  mailer  v.i  ,.i      .■-.  alpha- 
bet T  is,  so  vre  speak  cf  the  free  language  en  n  symbols.    Let-  Tn  reprci  ::v.  ;...;    alpha-  - 
cf  n  symbols: 

1  n      =     T\  i  Tg   ~D'  b    —  t»  . 

Then  Fn,  the  free  language  on  n  symbols,  is  defined 

F*    =    -rn   =   (r,  9  •  •  ■    3r„). 

Of  course,  befcre  ?.re  can  compute  the  entropy  of  a  language  v.*3  must  annotate  .;  ' 
grammar  with  probabilities.  Therefore,  the  annotated  grammar  ::^r  the  free  langu?  ;  - 
on  n  symbols  is 

Theorem:  The  negentropy  of  the  free  language  on  symbols, 

is 

tf(Fn)  =  [/7(pjJ)  *p#(gi ?«)]/?■ 

proof:  Vf'e  simply  apply  the  formulas  from  Table  1: 
H(Fn)    =    H\(q  iTx®      ■  ■    SgnTn^'j 

=  [#(p.p) +P#igw®  ■      ag^-n^/p 

=  [H(p,p)+?^(gi g»)  +gi^(r,)+  +  g„#(TB)|]/p 

=    [H(p,p)  +pH(qv  •  •  •  .qn)]/p. 

Q.E.D. 

The  free  language  on  one  symbol  r  is  just  the  set  of  all  strings  of  rs: 

L{F{)    =    it,-,rr,rr-    ••  •  \. 
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The  following  theorem  defines  the  negentropy  of  F\ 

Corollary:  The  negentropy  of  the  free  language  with  continuation  probability  p  on 
one  symbol  is: 

H\t\\    =   H(p,p)/p. 

Proof:  We  simply  use  the  previous  theorem  with  n-l. 

Corollary:  The  negentropy  of  the  free  language  with  continuation  probability  p  on  an 
alphabet  of  n  equally  likely  symbols  is 

Proof:  To  derive  this  simply  set  qi  -  1/v.  in  the  negentropy  formula  Tor  Fn: 

H(Fn)    =   [H(p,p)  ^pHiqy qn)]/p 

-    [H(p,p)  +  pH(l/n l/n)]/p 

=   [^(p,?)+pS(l/n)lg(l/n)]/p 


i=i 


=   lH(p,p)+plg(l/n)]/p 
-   [H(pp)  -plgn]/p. 


Q.E.D. 


Table  2  shows  the  entropies  of  free  languages  on  equally  likely  symbols  for  several 
different  continuation  probabilities. 


TABLE  2. 

Entropies 

of  Free  Langi 

aa^es  on 

Equally  Likely 

Symbols 

V\n 

2 

4 

3 

10 

12 

64 

256 

0.1 

0.63 

0.74 

0.95 

0.39 

0.92 

1.19 

1.41 

0.2 

1.15 

1.40 

1.65 

1.73 

1.30 

2.40 

2.90 

0.3 

1.69 

2.12 

2.54 

2.53 

2.  SO 

3.83 

4.69 

0.4 

2.23 

2.95 

3.62 

3.53 

4.01 

5.62 

6.95 

0.5 

3.00 

4.00 

5.00 

5.32 

5.58 

8.00 

10.00 

0.6 

3.93 

5.43 

6.93 

7.41 

7.80 

11.43 

14.43 

0.7 

5.27 

7.60 

9.94 

10.69 

11.30 

16.94 

21.60 

0.8 

7.61 

11.61 

15.51 

16.90 

17.95 

27.51 

35.61 

0.9 

13.69 

22.69 

31.89 

34.59 

36.95 

58.69 

76.89 

This  table  suggests  that  we  consider  the  special  case  in  which  n  is  a  power  of  two  and£ 
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io  one  half.   This  leads  to: 

Corollary:  The  negentropy  of  the  free  language  with  continuation  probability  one 
half  and  2k  equally  likely  symbols  is  — Ig  h  —  2.    Conversely,  the  entropy  is  Ig  .1"  +■  2. 

Proof:  Let  p  =  Yi  and  n  =  2"  in  the  formula  from  the  previous  corollary  and  -.ve  have 

H   =  \H{p,p)  -plgn]/p 

=  2DSlgJi  +  Jilg)S-A:/2] 

=   21g  ;4  -  Ig  fc 
=    -2  -  lg  A . 

3.4  Ccmpuling  the  Entropy  of  a  Contcxt-7rcs  Grr.rr.mr.-m 

In  this  section  we  extend  the  results  of  the  previous  sections  to  the  computation  of  the 
entropy  of  any  context-free  grammar. 

As  usual,  '.ve  define  a  context-free  grammar  G  to  be  a  quadruple, 

G   =    <T,  .V,  P,  u0>, 

in  which  T  is  a  finite  set  of  terminal  symbols,  N  is  a  finite  set  of  nonterminal  symbols, 
Vo£iV  is  the  goal  symbol,  and  P  is  a  finite  set  of  productions, 

P  z  N  X  (T  uN)\ 

That  is,  each  production  is  a  pair  of  the  form  <v,a>,  in  which  v  is  a  nonterminal  and  a 
is  a  finite  string  of  terminals  and  nonterminals.  Such  a  production  is  usually  written 
'v  -»  a'.  The  Sac hus-Naur  form  (BNF)  of  a  context  free  grammars  combines  all  the  pro- 
ductions for  a  given  nonterminal  into  a  single  productions.  For  example,  if  context- 
free  grammar  contains  the  following  productions  for  u: 

u  -»  a.i 
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v  -»  an 

tlien  the  DNF  form  ci  I. his  grammar  combines  them  into  a  single  production: 

v   -*   c<i  S  e>2  9  ■  •  •   ;■£)  an. 

In  the  following  discussion  v.e  will  usually  use  the  BNF  form  of  grammars. 

The  characteristic  that  distinguishes  context-free  grammars  from  regular  gram- 
mars is  that  the  productions  of  a  context-free  grammar  can  be  mutually  recursive. 
That  is,  a  nonterminal  v  can  be  defined  in  terms  cf  a  string  that  is,  directly  or 
indirectly,  defined  in  terms  of  v.  It  is  '.veil  known,  however,  (see  GLnsburg)  that  each 
production  in  a  BNF  grammar  can  be  considered  an  equation  on  sets  cf  strings.  If  we 
recursively  define  L(a),  the  language  defined  by  a,  as  fcilc"Ars: 

L(s)   =   [e\ 
L(r)    =    \r\ 

L(a$)    =   L{a)  *  L((3) 

where   SxT  =  [ap  |  as5,  [leT] 
L(ae^)    =    L(a)  u  L((3) 

then  each  production  y-*a  of  a  context-free  grammar  G  can  be   transformed  into  a 
corresponding  equation 

L{v)   =   L(a). 

Let  G  =  <T,N,P,Vo>  be  a  context-free  grammar,  in  which 

P  =  \vq  -*  a0,  vx  -*  a.i vk  -»  a*  1 

is  a  set  of  productions  in  BNF  form.    Then  corresponding  to  P  is  a  collection  of  simul- 
taneous equations  on  sets  of  strings: 

L(i'c)    -   I(ac) 
L{vx)    =    L(ax) 
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L(vk)    =   L(a.fc) 


The    solution    to    this   set   of   equations   defines    the   language    generated   b 


u,    sl: 


L(G)  =  L(vG). 

Context-free  grammars  can  be  annotated  ~vtth  production  probabililiif:  in  Lhs  same 
way  as  regular  grammars.  Formally,  we  define  an  annotated  context-free  grammar  C 
to  be  a  quadruple  <7\  A;,  P,  i/q>,  in  which  vce.V  -and 

PcRX  V  x  (T  u  .V)'. 

Thus  each  production  is  a  triple  <p,u,a>,  p  being  a  real  number  representing  the  pro- 
bability of  applying  the  production  v-*a.  Vie  impose  the  restriction  that  all  the  proba- 
bilities associated  with  the  productions  for  a  given  nonterminal  must  sum  to  unity: 

£        Pi    =    1.    for  ve.V. 

<pi.i/.ai>  6  P 

This  is  simpler  to  see  in  the  BNF  form  of  an  annotated  context-free  grammar.  In  any 
production  that  is  an  alternation, 

vie  must  have  that 

£pi  =  i. 

i  =  l 

Consider  an  unambiguous  annotated  context-free  grammar  G  and  let  I!  =  L(G)  be  the 
language  generated  by  G.  Let  Pq{o)  be  the  probability  that  a  string  a  is  generated  by 
G.  We  say  that  2  is  predicted  by  G  if  for  every  string  o,  rV(cr)  =  Pq(cj),  that  is,  the 
observed  probability  of  occurrence  of  a  is  the  same  as  the  probability  of  its  generation 
by  G.    We  now  consider  how  we  might  compute  the  negentropy  of  S  from  G. 

Consider  a  production  i/-»a  in  the  annotated  grammar;  this  corresponds  to  an  equa- 
tion L{v)  -  L(a).    Since  u-*a,  the  probability  of  a  string  being  generated  from  v  is  the 
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same  us  its  probability  of  being  generated  from  a.  Thus,  Pw(a)  =  Pa('7)>  for  dl 
a  €  £(v)  =  Z.- (_ oc) .  1'hus,  the  negentropy  of  L{u),  which  we  can  write  H{v),  is  the  same  is 
the  negentropy  cf  I.(a),  which  we  can  "."rite  H(a).    That  is, 

It  can  be  seen  that  corresponding  to  trie  ENF  -"reduction::  s.'t -'•-.;  in  J~  LLere  i_  .:.  c_t  c£ 
simultaneous  equations 

H{v0)   =   H(a0) 
Hiyd   =  H(ai) 

H(yn)    =   H(an) 

that  can  be  solved  to  yield  the  negentropy  of  the  language  predicted  by  the  grammar. 
In  particular, 

H(Z)   =   H(G)   =   H(vG). 

We  have  already  made  used  this  technique  in  applying  the  recursive  definitions  of  of  .-l-"1* 
and  Ap+  to  solve  for  their  negentropy.  In  summary,  the  methods  developed  previously 
for  computing  the  negentropies  of  regular  languages  can  be  extended  in  the  obvious 
way  to  context-free  languages. 

4.   Determining  the  Production.  Probabilities 

4. 1   Measurable  Properties 

To  compute  any  specific  entropies  we  need  to  know  the  probabilities  of  applying  the 
productions  in  the  appropriate  grammar  for  the  language.  These  can  be  obtained  by 
determining  measurable  parameters  whose  values  are  implied  by  the  production  pro- 
babilities. That  is,  the  measurable  properties  are  a  function  of  the  production  proba- 
bilities. The  production  probabilities  can  then  be  determined  by  (analytically  or 
numerically)  inverting  this  function. 
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Y/hat  measurable  properties  should  we  use?    One  of  the  simplest  is  the   icr.z*- 
occurrence  of  a  token.    Let  OccT(cr)  be  the  number  of  occurrences  of  the  symbol  r  m 
the  string  cr.    This  is  formally  defined: 

Cccr(r)  =    0, 

OccT(r)  =    1, 

OccT(r')  =   0,    for  T^f, 

Occt(ct)  =   OccT(a)  +  OccT(p),  for  a  =  a/3. 

The  ^r(G),  then  density  of  occurrence  of  r  in  the  language  generated  by  G  is 

V    Fc<a)0ccT(a) 

•   /  /->>    _    r-~-  -A 

<-rv-J    -  — ~     ~TZ : 

where  Pc(u)  is  the  predicted  probability  of  generation  of  a  and  ,  a\  is  the  length  of  a.  If 
G  predicts  L(G),  then  &r(G)  will  be  the  observed  density  of  occurrence  of  r  in 
languages  generated  by  G. 

The  formula  for  ^r(G)  suggests  two  useful  properties  of  a  grammar:  the  average 
length  of  the  strings  it  generates  and  the  average  number  of  occurrences  of  a  token  in 
a  string.   We  let  A(£)  be  the  predicted  average  length  of  the  strings  generated  by  G: 

A(G)    =      y    FG(a)[j  . 

We  let  ^r(G)  be  the  predicted  frequency  of  occurrence  of  r  in  the  strings  generated  by 
G: 


$t(G)   =      2    Pc(ff)0ccT(ff). 


It  then  follows  that 


Ar(G)    =   $T(G)  /  \(G). 

The  goal  then  is  to  find  ways  to  compute  '!>-(£)  and  A(G)  from  G.    This  will  permit  us  to 
calculate  predicted  values  of  It  which  can  be  compared  with  actual  measurements. 
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Therefore,  I  he  next  two  sections  present  means  for  calculating  A(C)  pnd  §T{G). 
4.2  Average  SLricg  Length 

\Ie  begin  again  with  regular  grammars. 

Theorem:  The  average  lengths  of  the  empty  and  single  token  grammars  are  defined: 

A(e)    =   0  tokens, 
A(t)    =    1  token. 

Fro  of:  Obvious. 

For  the  remaining  derivations  we  need  some  notation.    Suppose  that 

L(A)    =    \altc:2,     •  •  j, 


L{3)   =   \pltfa  .  .  .  I, 
Oi    =    PaM, 

bj   =  Patti). 

Then  it  follows  that 

A(.4)    =   V^jcxi!, 
i 

MB)   =  E&jlftl- 
;" 

Theorem:  The  average  length  of  the  catenation  of  two  grammars  is  the  sum  of  their 
average  lengths: 

MAS)   =    \(A)  +  A(5). 

Proof:  Note  that  is  a€.L(AB)  then  a  =  a,:/?,-  for  some  <x,zL(A),  (3j€.L(B).    Assuming  as 
usual  that  the  choices  from  A  and  B  are  independent, 

^5(ff)   =   PaMPb(Pj)   =  Ot&j. 

TrVe  now  derive  the  average  length: 
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KAB)   =   SS^(ai^-)l«i^l 

=  £E«i*j(l«d  +  i/3;  i) 

'-  J  J 

=   SocClcxtl  +A(5)] 

t 

i 

=  AU)  +  A(S). 


O  F  I 


CorcllarT:  A(.47!)   =  n.\(A) 

Theorem:  The  average  length  of  an  alternation  Is  the  average  of  the  average  length; 
of  the  alternands: 

A(pA3p3)    =  p>\(A)  +  pM3). 

Proof:  Let  G  -  pA  ®p3.    Recall  that  if  jel(G)  then  either  ozL(A)  or  oeL(B),  and  that 
the  choice  from  ,4  is  made  with  probability  p.    Therefore 

PQ(a)   =  pPA(a),  Lfaei(A), 
Pg(o-)   =  pPB(a),itozL(B). 

Then  we  derive: 

A(G)   =      V    Pc(a)|ff| 


=  pAU)+J>A(£). 


££.27. 
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Theorem:  A(,i>'+)    =   A(A)/p. 

Proof:  We  appeal  to  the  Lnfi.ni.te  expansion: 

Ap  +    =  pA  %ppA2  :Bp2pA3  8 

Applying  A  to  both  sides: 

A(_4?^)    =  pMA)  +pp\(A3)  +  p2p:\(A3)  +    ■  ■ 
=  p[l  +  2p  +3p2+    •  ■  •  ]A(/1) 
=  p[l/pz]A{A) 

=  A(A)/p. 

Q.E.D.    Alternately,  we  can  appeal  to  the  recursr.-e  definition: 

A{Ap¥)    =  pA(A)  +  p\(AAp>) 

=  pA(A)  +pMA)  +  pA(Ap+). 

Grouping  like  terms  gives 

(l~p)A(Ap+)    =    (p+p)A(A). 

which  leads  directly  to  the  result.    Q.E.D. 

Theorem:  A(AP')   =   (p/p)A(A). 

Proof:  We  apply  A  to  the  infinite  expansion  of  Ap*: 

A(AP')   =  pA(s)  +ppi\(A)  +p2pi\(A2)  +  p3pA(A3)  +   ■  ■ 
=  p-Q  +  pp\(A)  +  p2p-2A(A)  +  p3p-3A{A) 
=  pp(l  -  2p  -3p2  +      •     )A(A) 
=  pp(l/p2)A(A) 
=   (p/p)A(A). 

Q.E.D. 

Alternately,  ".re  can  apply  A  to  the  recursive  definition: 

MAP')   =  pA(e)  +pA(AAp') 
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=  pO+p<\(A)  +pA(,U'*). 
Solving  for  i\(Ap*)  we  get 

MA*')    =    (j>/p)i\(A). 

Q.E.D. 

We  consider  some  simple  examples  based  on  free  languages. 
Theorem:  Consider  the  free  language  on  n  symbols  generated  by: 

The  average  length  of  the  strings  of  this  language  is 

X   =   l\{Fn )    -  p/p  tokens. 
Proof:  Since  A(tJ  =  1  and  q:  +    ■  ■  •    +  qn  -  1, 
A(Fn)    =   (p/p)[?1A(r1)  +  +gnA(rn)] 

=  (p/p)[gi  +  ■  ■  •  +gn] 
=  p/?- 

Notice  that  the  average  length  of  a  free  language  is  independent  of  the  number  of 
symbols  in  the  alphabet.    This  is  to  be  expected. 

Corollary:  The  average  length  of  a  free  language  with  continuation  probability  ^  is  i 
token. 

proof:  Apply  the  previous  theorem  withp  =  p  =  %    Q.E.D. 


TABLE  3.   Average  Stri 

tig 

Len 

sth 

for 

Reg 

liar 

Grammars 

A(s) 

- 

0  tokens 

A(r) 

- 

1  token 

MAB) 

- 

MA)  +  A(5) 

MpA  SpB) 

— 

pAU)+pA(i?) 

A(^+) 

- 

A(4)/p" 

Map0) 

= 

(p/p)AU) 

The  formulas  for  computing  average  string  length  are  summarized  in  Table  3. 
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"k3  Average  Token  Frc-iiietjcy 

The  formulas  for  computing  average  token  frequency  are  almost  identical  to  these  for 
average  length.  For  this  reason  the  proofs  are  omitted  and  the  results  are  shov/n  in 
Tabic   L 


TAEO  A 

At  ■*\oj-'™t  or  p  TVTrF 

n 

Fr 

Or'rj  r-i  Q'- 

rQT"  Fce"uiar  Graiximars 

Ms) 

Mr) 

Mr') 

MAB) 

Mi>A£pB) 

MApn 

Map') 

- 

0 
1 
0,  for  r^r' 

§TU)  +   MB) 
P*TU)+pST(i?) 

Theorem:  Consider  the  free  language  on  r>,  symbols: 


^     =     (?1M   5  32-2  © 


a 


-      M3* 


in  '  n  J 


The  frequency  of  occurrence  of  token  tx  is 

y,    -    $t,(^ti)    =    gtP/p- 
Proof:  Vie  derive  as  follows,  abbreviating  $T  by  $.: 

*i(^n)   =  (p/p)$i[giT!®     ••   9gnrn] 

=  (p/P)[qMri)  +  ■  ■  ■   +  gri*i(T*)  +  •  •  •   +  gn*i(rn)] 

=  (p/p)[q  i-0  +   •  ■  '    +  g*'l  +   ■  •  •    +  g*-0] 

=  (p/p)gt. 

Corollary:  In  the  free  language  of  the  previous  theorem,  the  density  of  occurrence  of 
symbol  rt  is  gi.   We  denote  this  measurable  property  5*. 

Froo/:  Since  ^(Fn)   =   ^(fj/^n),  we  have 

p/p 
=   7i- 
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Q.E.D. 

The  following  corollary  shows  that  for  the  case  of  a  free  language  it  is  easy  tc  com- 
pute   the    production    probabilities    from    measurable    properties    of    strings    In    the 

X =  —  — 0  -• 

CcrcilzLry:  If  "re  measure  the  properties  5-L,  oS,  ...,  Jn  end  ,\  of  a  free  L_r:guag3  -n  ; 
symbols,  then  we  can  compute  the  probabilities  q .,  q?,  ■  ••,  qn  and/)  from  them  by  the 
formulas: 

li    =    5t, 

p    =   V(A+1). 

Proof:  The  formulas  for  qx  are  obvious.    Forp  we  know  that 

A   =  p/p    =  p/(l-p) 
Therefore  \—p\  -  p,  so  \  =  p+p\.   Thus  A  =  p(A+l),  so p  =  A/(A+l). 

Corollary:  Tne  negentropy  of  a  free  language  exhibiting  occurrence  densities  5{,  ..., 
<5n  and  average  string  length  \  is: 

Hn    -    H(\,  A+l)  +  \H(6i 6n). 

Proof:  To  derive  this  result  we  take  the  formula  for  the  negentropy  of  a  free  language, 

Hn    =   H(Fn)    =    [H(P.P)  +  P#(?i In)]/P, 

and  substitute  the  values  for  qz  and  p   derived  in  the  previous  corollary.    To  do  this, 
note  that 

-    -    1  -    1  *  1 

p    =    i-p    =    l  - 


A+l        A+l 
Then  we  have 


-?Q. 


H  i 


tin     = 


\+l  '   \+l  J       A+l 


"(5,, 


.0,) 


A+  1 


=  (x+i; 


f  > 


A+l     =  A+l         A 4-1     °\+l  \+l 


■Wl. 


<5r.) 


=   A  IgTTT-  Lg(X+l)  +  \#(<Ji <5n) 

A-f  L 


=    Mg\- A13(A+1)  -  lg(A-rl)  +  XitYo"!    ..      ,Jr.) 


=   A  Ig  A  -  (,\+l)lg(\+l)  +  \H(5 


On) 


-   H(\,  A+l)  4-  A//(d, <5n). 

Thus  we  have  the  negentropy  (and  hence  entropy)  of  a  language  expressed  entirely 
in  measurable  parameters. 

4.4  Average  Information  per  Symbol 

Recall  that  the  entroDV  of  a  language  measures  the  average  information  born  by  each 


"O  3 


string  in  the  language.   That  is, 


H(Z)    =    E^(j)/S(ff). 


However,  each  of  the  strings  of  the  language  is  composed  of  a  number  of  terminal  sym- 
bols (tokens).  Therefore,  it  is  interesting  to  compute  the  average  information  born  by 
each  symbol   (token)    in   the   strings  in   the   language.    We   call   this   the  information 
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density  of  the  language. 

Information  density  is  easy  to  compute:    it  is  simply  the  average  information  born 
by  the  strings  of  the  language  divided  by  the  average  Length  of  those  strings: 

77(E)   =   H(L)/A(Z), 

where  we  have  used  77(H)  for  the  average  information  born  per  symbol  in  the  strings  I. 
The  units  of  information  density  are  bits/token. 

If  the  grammar  G  predicts  the  language  H,  then 

i?(S)    =   17(G)    =   //(G)/ A<  7  . 

We  use  this  result  to  compute  the  informal:.,  n  density  fcr  several  Langu 

Theorem:  Let  Fn  be  the  free  language  with  continuation  probability  .p  on  n  symbols 
with  probabihties  qx.   Then,  the  information  density  of  Fn  is: 

rj(Fn)    =   H(p.p)/p  +  H{ql qn)     bits/token. 

Proof:  Take  the  formula  for  the  negentropy  of  Fn  and  negate  it  to  get  the  entropy  for- 
mula: 

H(Fn)   =   [H(p,p)  +PH(qi qn)Vp- 

Divide  this  by  the  average  length  A(Fn)  =  p/ p  to  get  the  information  density: 

rj(Fn)   =   H(Fn)/A(Fn) 

[Hjp.p)  +pH(qx,  ■         .qn)]/p 
p  /  p 

=    H(p,p)/p  +  H(qi,  .  .  .  ,qn). 

Q.E.D. 

Corollary:  The  information  density  of  the  free  language  en  one  symbol  with  continua- 
tion probability  p  is: 

7}(F ;)    =   H{p,p)/p     bits/token. 
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Corollary:  The  information  density  cf  the  free  language  with  continuation  probability 
p  on  n  equally  likely  symbols  is: 

r]    -   H(p,p)/p  +  lg  n     bits/ token. 

Proof:  Vie  simply  use  a,  =  1/n: 

T)    =   H(p,p)/p  +  H{qY,  .  .  .  ,qn) 

=   H(p,p)/p  4-  H(l/n l/n) 

=   H(p,p)/p  +  (l/n)lg  n  +    •  ■  ■    4-  (l/n)lg  n 

=   H(p,p)/p+lgn     bits/token. 

Ccrcllary:  The  information  density  of  the  free  language  -.vith  continuation  probability 
one  half  and  Zk  equally  likely  symbols  is  fc+2  bits/token. 

Proof:  Let  n  =  2K  and  p  -  p  -  }?  in  the  previous  formula: 

=   2ff0g,Ji)  +  fc 

=   2(H  lg  2  +  J£  lg  2)  4-  .1- 

=   2(Jg  +  )S)+fc. 

Q.E.D. 

The  difference  between  the  entropy  of  a  language  and  its  average  information  den- 
sity can  be  understood  by  looking  at  some  simple  examples.  In  particular  v.re  will  con- 
sider the  languages  Nk  of  all  nonempty  strings  on  k  symbols.  Thus,  Nk  is  just  the  free 
language  Fk  without  the  empty  string.    Conversely, 

Fh    =   Nk  S  s. 

Theorem:  Let  Nk    =   (g^S  •  •  ■   ©  qkrk)?+.   Then, 

H(Nk)    =   [H(p,p)  4-  H{qx qk)]/p     bits 

A(Nk)    =    Up     tokens 
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r)(Nb)   =    H(p,p)  +  ff(g, qk)     bits,  token. 

Proof:  Simply  apply  the  previous  formulas.    Q.E.D. 

To  understand  the  implications  of  this  result  we  consider  an  especially  simple  case, 
Nlt  the  language  of  all  nonempty  strings  on  one  symbol  r: 

Nx     =    Tp\ 

Kence  L(N{)    =    [r,  tt,  rrr,  .  .  .  \. 

Corollary:  If  the  continuation  probability  of  N\  is  one  half,  then 

H(NX)   =   2  bits 

A(ArJ    =   2  tokens 
r](N\)    =    1   bit/token 

proof:  Substitute  p  =  •£  in  the  previous  formulas  and  recall  that 

h(M)  =  J$  ig  a  +  %  ig  a  =  ig  2  =  i  bit. 

This  result  is  easy  to  interpret,  since  each  succeeding  token  indicates  that  the 
choice  has  been  made  to  continue  the  string.  Since  the  probability  of  the  choice  is  one 
half,  each  token  conveys  one  bit  of  information. 

Next  we  consider  .Vo,  which  can  be  considered  the  language  of  nonnull  strings  of 
binary  digits: 

The  following  theorem  addresses  the  information  density  of  this  language. 
Theorem:  If  .V2  is  the  language  of  nonnull  binary  strings: 

N2  -   (gOOgl)**, 
then 
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#('Vs)    =    Ul(p.p)  +n(q,q)}/p     bits 

A(;V2)    =    1//J     tokens 

rj{:v2)    =   H(p,p)  +  H(q,q)    bits/token. 

Proof:  Apply  the  previous  formulas.    C  - ■-'• 

Ccrcllary:  Suppose  the  binary  dibits  D  and  i  ara  equally  likely.   Then  the  information 
density  of  the  language  of  ncnnull  binary  strings  is 

rj(Nz)   =    1  +  H(p,p)    bits/token. 

Proof:  Apply  the  previous  theorem  with  q  =  q  =  )~.    Q.E.D. 

Note  that  ?ince  H(p,p)>0,  we  know 

r](.V2)  >  1  bit/token. 

Since  in  this  case  a  token  is  a  binary  digit,  we  have  the  somewhat  surprising  result  that 
the  information  density  of  the  language  of  binary  strings  is  greater  than  one  bit  per 
binary  digit.  How  can  this  be?  The  extra  H(p,p)  bits  of  information  per  binary  digit 
comes  from  the  fact  that  the  binary  strings  are  variable  length.  We  previously  saw  that 
in  A'i  the  continuation  of  the  string  conveys  H(p,p)  bits  of  information. 


The  source  of  the  extra  information  can  be  made  clearer  by  considering  a  language 
in  which  it's  absent,  the  language  of  all  n  digit  binary  strings: 

Wn    =    (qO@ql)n. 

Let  q  =7   =  ){.    Then 

H(Wn)   =  nH{)M)   =  nbits 

A(^Ti)    =   n  h\qO%ql]    -   n  tokens 

r](?/n)    =   n/n    =    1  bit/token. 

Thus  the  information  density  of  Wn  is  one  bit  per  binary  digit,  as  expected.  Since  ail 
the  strings  of  Wn  are  the  same  length,  no  information  is  conveyed  by  the  continuation 
of  the  string. 
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Consider  again  the  language  of  nonempty  base  k  strings: 

Nk    =    (giTi©   ■  ■  •    ®qbTky+. 

We  have  seen  that  its  information  density  is 

r)(Nh)    =    H(p,p)  +  H{qx q:<)     bits,  token. 

Now  we  can  see  that  this  result  is  Intuitive,  since  each  additional  symbol  in  a  string 
conveys  two  pieces  of  information:  the  decision  to  continue,  H(j?,p)  bits,  and  the  sym- 
bol chosen  for  the  continuation,  H(qi,  .     .  ,  g^)  bits. 

5.    Lrucrmation  Theoretic  Properties  cf  Grammars 

In  this  section  we  consider  two  irilcrm  alien  theoretic  quantities  that  are  properties 
of  grammars,  as  opposed  to  properties  of  the  languages  predicted  by  those  grammars. 
These  properties  are  the  average  length  of  a  derivation  from  a  grammar  and  the  aver- 
age amount  of  information  consumed  by  a  production  in  a  grammar. 

5. 1   Average  Derivntion  Length 

First  we  consider  the  average  length  of  a  derivation  from  a  context-free  grammar  G, 
that  is,  the  average  number  of  productions  that  must  be  applied  in  going  from  the  goal 
symbol  u0  to  a  terminal  string  a.  We  apply  similar  techniques  to  those  previously  intro- 
duced, transforming  the  set  of  productions  into  an  equivalent  set  of  simultaneous  equa- 
tions. 

We  will  let  D(a)  represent  the  average  length  of  a  derivation  from  a  string  a  of  ter- 
minals and  nonterminals.    Nov/  consider  an  arbitrary  BNF  production 

in  P,  the  set  of  productions  in  G.  We  want  to  compute  D(v),  the  average  length  of  a 
derivation  from  v.  In  deriving  a  terminal  string  from  a  string  containing  u,  we  apply 
the  production  v->a,  v.dth  probability  p,..  If  we  apply  u-*a.i,  then  the  length  of  the 
derivation  is  one  plus  the  length  of  the  derivation  from  ax.    The  same  holds  for  each 
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t/--»Otj.   Thus  Lhe  average  length  of  a  derivation  from  v  is 

D(v)   =  Pl[l  +  /?(«!)]  +   •  ■  ■    +p„[l  +  0(a„)] 

=  (?!+••■    +p») +PiZ?(aJ+    ••■    4-pnZ7(a„) 

This  result  is  Intuitive:  the  constant  1  accounts  fcr  that  fact  that  we  must  apply  irons 
production  to  eliminate  the  i-';  the  remainder  cf  the  terms  are  the  weighted  average  of 
the  average  derivation  lengths  of  the  alternands. 

Next  we  derive  a  number  of  rules  for  simplifying  the  right-hand  sides  of  these  equa- 
tions.  If  a  is  the  empty  string,  then  no  productions  can  be  applied  to  it,  so 

D(s)    =   0. 

If  a  begins  with  a  terminal  symbol  r,  a  =  r/3,  then,  since  no  productions  can  be  applied 
to  a  terminal,  we  have 

If  a  begins  with  a  nonterminal  symbol  ii,  a  -  /i/3,  then,  since  both  ,u  and  /?  must  be 
reduced  to  terminal  strings,  the  average  length  of  a  derivation  from  a  must  be  the  sum. 
of  the  average  lengths  cf  derivations  from  ,u  and  /3: 

D(tu8)   =  D(p)  +0(0). 

We  summarize  the  formulas  for  computing  average  derivation  length  in  Table  5. 
TABLE  5.    Average  Derivation  Length  fcr  Grammar 


D\v^pxax%   ■■■    '&pnan\    =   D{v)  =  1  +plD(al)  +  +  pnD(an) 

D{z)  -  0 
D(r)  =  0 
D(ap)    -   D(a)  +  D(p)  


Theorem:  Let  Fn  be  the  following  annotated  grammar  for  the  free  language  on  n  sym- 
bols: 

Fn   ->  ps®pAFn 
A  -  gxTi©  •  ■  •   8?„Tn. 
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Then  Lhe  average  derivation  length  of  Fn  is 

D(Fn)    =   (p  +  l)/p     productions. 

Proof:  We  transform  the  productions  into  the  equations: 

D{Fn)   =    l+pD(s)  +  pD(AFn) 

D(A)   =    1  +g1Z7(r1)  +   •  •  •    +  qnB(rn). 

Thus,  D(A)  =  1.   Simplifying  we  derive 

D(Fn)   =    1  +p[D(A)  +  Z?(/-n)] 
=   1+p  +pD(Fn). 

Solving  for  D{FrJ)  we  ha'~e 

-^(-Si)    =    Cf  +  1)/^     productions. 

<?..£  D. 

As  would  be  espected,  the  average  derivation  length  is  independent  cf  the  probabili- 
ties jf. 

Corollary:  The  average  derivation  length  cf  Fn  ••vith  continuation  probability  one  half 
is  3  productions. 

Proof:  Apply  theorem  with  p  =  p  =  %    Q.E.D. 

Tnis  result  is  also  intuitive.  We  always  must  apply  at  least  one  production  (for  Fn). 
If  we  choose  to  stop,  with  probability  one  half,  we  have  applied  one  production.  How- 
ever, if  we  choose  to  continue,  with  probability  one  half,  then  we  must  apply  two  more 
productions  (one  for  A,  one  for  Fn),  and  repeat  our  choice.   Thus  we  have: 

=  a  + 1  +  )i2  +■  &  +  )i3  +  ■  •  • 

Regrouping  gives 
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=   2+1  production?. 

5.2  Average  LofcrrncLIoii  Usad  by  a  Production 

Consider  a  leftmost  derivation  of  a  terminal  string  from  a  grammar.  At  each  point 
in  the  derivation  a  nonterminal  must  be  replaced  by  a  string  according  to  the  produc- 
tions of  the  grammar.  In  many  of  these  cases  there  ;vill  be  a  chche  of  '  hich  of  several 
alternate  productions  are  to  be  applied.  Such  a  choice  will  require  information  to  be 
supplied.  Considering  the  average  information  that  must  be  supplied  per  applied  pro- 
duction gives  us  a  gauge  of  the  efficiency  with  which  a  grammar  transforms  informa- 
tion into  terminal  strings. 

Since  in  an  unambiguous  context-free  grammar  there  is  a  unique  leftmost  derr  a- 
tion  of  any  string  in  the  language  generated  by  the  grammar,  we  can  set  up  a  one-one 
correspondence  between  the  leftmost  derivations  and  the  strings.  Thus,  for  any 
ctEjL(u)  there  is  a  unique  series  of  productions  -.,  tto,  ...,  tt^  that  generates  a.  As  we 
saw  before,  if  production  n\  has  an  a  priori  probability  P(7>i)  of  being  chosen,  then  the 
probability  that  G  will  generate  a  is 

Pg(ct)   =   P(~X)P(-Z)--  ■P(nh). 

Hence,  the  generation  of  a  string  by  a  grammar  can  be  viewed  as  a  series  of  choices  7T"i, 
....  7Tfc,  having  probabilities  Pin,) P(~:<)  of  being  made. 

Now  we  look  at  this  formula  a  different  way.  Recall  [Shannon,  Hamming]  that  when  a 
previously  undetermined  situation  with  a  priori  probability  p  is  determined,  the  infor- 
mation conveyed  is  -Ig  p  bits.  That  is,  Information  is  conveyed  by  making  choices. 
Thus,  the  information  conveyed  by  making  choice  77z,  with  probability  i-,(rrl),  is 

1(77,)    =    -IgPMbits. 

Therefore,  the  information  conveyed  by  a  is  just  the  total  information  conveyed  by  the 
choices  that  lead  to  a: 

Ida)     =     -\gPC(o)     =     -lg  P{TXi)   +    ■   ■   ■     +  -lg  P(7Tk)     =    1(77,)  +     •  •   ■     +  /(7Tfc). 
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This  information  is  used  by  the  grammar  in  going  from  an  undetermined  rconl  ltijL  il 
symbol  to  a  completely  determined  terminal  string.  That  is,  this  information  drives  a 
decrease  in  the  entropy  from  HiC)  to  0  (since  a  terminal  string  has  no  entropy). 

Mot.'  recall  that 

[■7(G)    =    -    v     P-(r\  \a  P.-fj) 
crel(C) 

a€L{G) 
Thus,  the  entropy  of  a  language  is  the  average  information  conveyed  by  its  strings. 

A  grammar  with  higher  entropy  is  less  constrained,  mere  disordered,  than  one  with 
lower  entropy,  so  on  the  average  it  takes  mere  Information  to  generate  a  particul  ir 
string  from  it.  A  grammar  with  low  entropy  is  highly  constrained,  so  en  the  average  lit- 
tle information  is  needed  (or  used)  in  generating  a  string;  there  are  fewer  choices  to  be 
made. 

We  can  now  apply  these  results  to  determining  the  average  information  used  per 
production  by  a  grammar.  Since,  the  information  conveyed  by  a  string  is  the  same  as 
the  information  used  in  its  derivation,  the  entropy  of  a  language  measures  the  average 
information  used  by  a  grammar  in  generating  a  string  of  that  language.  If  we  also  know 
the  average  derivation  length  for  that  grammar,  that  is.  the  average  number  of  produc- 
tions needed  to  generate  a  string,  then  we  know  the  average  amount  Q  of  information 
consumed  per  production.    Summarizing, 

Q{G)   =   H(G)/D(G), 

vfhereH(G)  =  H[L(G)]. 

Theorem:  Consider  the  following  grammar  for  the  free  language  on  n  symbols: 

?n    ■*  P  £  S  pAFn 
A   -»   gtTie  •  •  •   ®qn~n- 

This  grammar  uses  on  the  average 
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<?('T-t)    -    U'F'p  -Pi  +  P^i'li-  •        i?7i)]/CD  +  1)     bits/production 

to  generate  a  string. 

Proof:  Recall  that 

H{Fn)    =    [H(p,p)  +pH(q, qn)]/p     bits 

D{Fn)    -    (p  +  l)/p     productions, 

and  divide. 

Q.E.D. 

Corollary:  F2  with  continuation  probability  one  half  uses  on  the  average  one  bit  of 
information  per  production. 

Proof:  Set  p  =  p  -  q  x  =  q%  -  \'i.   Then, 

Q{FZ)   =   [H(&)Q+%H(&lQ]/<&+  1) 

=  #(&$ 

=    1  bit. 

Q.E.D. 

This  is  intuitive,  since  this  grammar  must  use  one  bit  on  each  production,  either  to 
decide  whether  or  not  to  continue,  or  to  decide  which  symbol  to  generate. 

Consider  Flt  the  above  grammar  restricted  to  generate  the  free  language  on  one 
symbol,  with  continuation  probability  one  half.  The  above  formula  says  the  information 
consumed  per  production  is 

Q(Fy)  =  [ff(&j0  +  jsiy(i)]/os+  i) 

=   2/3     bits/production. 

This  might  be  surprising,  since  it  seems  that  with  each  successive  symbol  of  the  string 
exactly  one  bit  of  information  is   being  used,   namely,    to  decide   whether  or  not   to 
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continue.   The  sou».ce  of  the  Inefficiency  can  be  found  by  looking  at  Fx: 

Fx   •*  pz  3  AFX 
A  -*  r 

Ihere  is  a  redundant  production  A  -»  r  that  uses  no  information;  this  decreases  the 
average  information  used  per  production.  If  *.ve  eliminate  this  redundancy,  "  c  gel  Lies 
one-production  grammar 

Fx    -»  ps  £  prFx. 

It  has  the  same  entropy  as  the  previous  grammar,  but  a  shorter  average  derivation 
length: 

D(FX)   =    l+PD(Fx) 
=    Up 

Forp  =  Yi  its  average  derivation  length  is  2  productions,  as  opposed  to  3  for  the  version 
with  the  redundant  rule.   The  information  used  per  production  is  then 

Q(FX)   =   H(FX)/D(FX) 
H(p,p)/v 

1/p 

=   H(p.p)     bits/ production. 

In  the  case  p  -  Yi  this  grammar  uses  one  bit  per  production,  as  would  be  expected. 
Thus  we  have  a  way  of  comparing  the  efficiencies  with  which  grammars  use  Information 
and  of  determining  whether  grammars  have  useless  productions. 

6.   Example  Applications 

In  this  example  we  illustrate  the  previously  described  techniques  by  their  application 
to  a  nontrivial  language.  Tabie  6  shows  the  context-free  grammar  for  lambda-calculus 
expressions;  we  have  added  unknown  production  probabilities.  We  now  apply  H  to 
these  productions  and  solve  for  the  negentrcpy.    Thus, 
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TABLE  6. 

Anne 

>ti 

ted 

Grammar  for 

L 

am'oda 

Calculus 

E 

•7:/ 

g2'('X /£••)' 
q?'C  E  E' )' 

I 

= 

(piaSpgb©  •  ■ 

'    ®P3c9)?h 

where 

1 

+g.c 

+  g3 

=  1. 

and 

Pi 

+p- 

+ 

•  -  a -p  =  1 

i 

H(E)   -    H(auqz,q3)  +  qxH(J)  +  q?HV  ('  A  /  E  ')')  ■(•  g3#r('  -~  '-  '   '!- 
Tokens  can  be  ignored  in  computing  negentrcpies,  so  this  reduces  to 

7?(r7)    =    H{qvqz,q$  +  qjl(l)  +  72[r7<7)  +  /?(£)]  +  q^ZH(E) 
=   ^(g  i.72.?3)  +  (7i+7a)^(/)  +  (ga  +  2q-jH(E). 
Solving  for  H(E)  v.-e  have 

27/ ^    _    ^(7i.7=.73)  +  (7i+7c)#(/) 

i-g2-2?3 

It  remains  to  solve  for  H(I).   We  apply  the  formula  for  /j(.Ap't")  to  get 
//(/)  =  j^(?ia®  •  •  ■  p263F+! 

=   [#(?.?)  +  H(pupz,  ■  ■  ■  ,pzs)]/p- 

The  resulting  formula  for  the  negentropy  cf  the  lambda-calculus  is: 

7j  H(ql,qz,q3)  +  (q  i  +  q?)[H(p  ,p  )  +  H(pl P23)]/ P  u., 

H    =    ; bits. 

1  -g2-2?3 

To  actually  compute  the  negentropy  it  is,  of  course,  necessary  to  determine  the  pro- 
duction probabilities/),  q\,  q%,  q$,  p\,  po p?Jj.    Since  all  the  probabilities  associated 

in  an  alternation  must  add  to  unity,  there  are  just  33  independent  probabilities  to  be 
determined. 

To  determine  these  probabilities  v.-e  will  calculate  the  measurable  properties  cf  the 
strings  in  the  language:  the  average  string  length  and  the  occurrence  densities  of  the 
tokens.   The  average  length  is: 
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A   =  A(£) 

=  giA(/)  +  q2M'('  A  /  E  •  )'\  +  g3A}'('  £"  £"  )'j 

=  7iA(/)  +  ^[AJ'OI  +  A$,\J  +A(/)  +  \  +AD'J]  +  73[AJ'('j  +  2A  +  A! 

=  ?iA(/)  +  i:[\(I)  +  \  +  3]  -(-  gs[;>  +  2] 

=  (?i+?2)A(/)  +  (g2+2g3)A  +  372  +  2g2. 

Solving  for  \  we  have 

(?i+3c)A(/)  +  3g2  +  2g3 


A   = 


1  -  ?,  - 


•i 


It  remains  to  compute  A(/): 

A(/)   =  A|(p;a3  :;:o^; 

=  Ajpxa®   •     •    ®p269|/p 

=     (pt+       ■       +J5C6)/p 

=    l/_p  tokens. 

Substituting  this  result  into  the  formula  for  A  gives  the  average  length  cf  a  string  in  the 
lambda-calculus: 

(ql+qz)/p  ^3?;  4-  2g3  L    , 

\   =    tokens. 

1  -  q2  -  2q2 

Recalling  that  the  information  density  of  a  language  is  the  ratio  of  its  entropy  and  aver- 
age length,  we  have 

H(qi,qz,q3)  +  (qx  qz)[H(p,p)  +  H(px Pie)}/?     .  .      ,    , 

rj    =    hir~    rcken. 

(q\+qz)/p  +  "?2  -'-  2? 3 

It  remains  to  compute  the  frequencies  of  occurrence  of  the  tokens  in  the  language. 
First  we  compute  ^]p,  the  average  number  of  left  parentheses  in  a  string. 

ftp    =    $!p(s) 

=  gi*ipU)  +  ?2<V  C  X  /  E' )']  +  g3*ipJ"  {' E  E  '  )'\ 

-  grO  +  g2(i+v>)  +  g3(i+piP+FiP) 
=  ga  +  12  +  (g2  +  2g3)^lp. 
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Solving  fur  ))\p  we  have 

72  +  ?3 


Hp 


1  -g3  -2: 
Clear!'-,  .-,.  =   ?,„.   it  i3  also  r 

°!p    =    °rp    ~      :\7.''  *     ~    ■'-?/  *• 

Next  we  compute  ^v  the  frequency  of  the  token  ',Y,  by  the  same  process: 

<?\    =   'h(e) 

=   q^x(I)  +  72'Kr('  XI  E  ')•]  +  nz'i.\:  ('  E  E  '  )'j 
=   ?2[*x(A)  +  ^x(E)}  +  Zq3$x(E) 
=   ?2(1  +  ?\)  +  2g30.v 
Solving  for  p^  we  have 


1  -  g2  -273' 
and  o,\  =  px/  X. 

Finally,  *.ve  solve  for  :plt  the  frequency  of  the  i-th  alphanumeric  character: 

9i    =   *t(«) 

=  gi*i(/)  +  g2[*i(/)  +  $i(^)]  +  2g3$i(^) 
=  (gi+?a)*i(/)  +  (g2+2g3)^. 

Solving  for  $94  we  have 

(gi  x  ?2J'^(/) 
^  l-go-2g3    • 

It  remains  to  determine  <$i(/): 

=   <Mpia©  •  •  ■   ®Pss9]/p 

=   [Pi$t(a)  +    ■  ■  •    +p36^(9)]/p 

=  pi/p. 
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Substituting  thi.^  into  the' equation  Lor  r~i  yields 


(1  -q?  -  ~q:)p 


Aa  usual,  .5,  =  v:    'A. 

Unfortunately,  the  equations  derived  for  the  lambda-calculus  ?.ri=  quii  ^    ilfTioult 
sclve.   In  practice  they  would  probably  have  to  be  solved  numerically. 

We  can  gain  some  insight  into  these  equations  by  considering  their  behavior  in  some 
typical  situations.  Therefore,  suppose  that  all  the  identifier  characters  ere  equally 
Likely: 


Then  we  have 


Pi    =    P*    =      •   '  =    PZ3     =     1/  -3. 


7l   +  ?3 


36p(l  -qz-2q3) 

Now  let  ,u  =  1—^2—273.   Then  we  have 

*  =   [(?;+?2)/p  +  3?3  +  2g3]/a, 

Pa  =   ?2/«. 

Pip    =    Prp    =    (72  +  52)/^. 
'A    =   (Si+{72)/(36pa). 

Rewriting  the  first  equation: 

*  =   [(?i+?2)/'Pa]  +  (3g2  +  252)/ a 
=   pi/36  +•  (2g2  +  2g3)/a 

=   p*/  36  +  q2/  a  +  (2g2  +  2g3)/  a 
=    p.;/36  +  9Jx  +  2^p. 

If  we  rewrite  this 

X   =   9^/ 36  +  px  +  plp  +  prp, 
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I.hen  il  becomes  intuitive:    the  average  length  of  a  string  is  Ihe  sum  of  the  av-_r 
quencies  of  occurrence  of  each  terminal  symbol. 

Finally,  we  derive  the  information  consumed  per  production  by  this  ^rarnmsa-.    To  do 
this  we  sz-ipand  th;-  Vljzr.d  cross: 

A    -»  pia.3   ■  ■  ■   Pz2$ 

and  compute  its  average  derivation  length: 

D{I)   =    1  +pD(A)  +  pD(A)  +  pDU) 
D{A)    =    1 

Therefore,  D[I)  -  Z  +  pD(/),  so  D(i)  =  2/p.    Next  we  compute  D{E)\ 

D{E)    =    1  +qiD(I)  +  qz[D{I)  +  D{E)]  +  q2[2D(E)} 
=    1  +  2{qx  +  q2)  +  (?2  +  272)j9(£"). 

Therefore, 

1  +  2(g1  +  c?o) 
jPfJT)    = — —    productions. 

The  average  information  used  per  production  is  the  ratio  H(E)/  D(E),  which  is 

n  H{qx,q2,q2)  +  (q  x  +  q2)[H(p ;p)  +  H(px,  .        .p2G)VP     ...     ,         ,       .. 

0    =    T7 ; bits- oroauction. 

l  +  2(gi+g2) 

7.   Conclusions 

We  have  described  means  for  computing  a  number  of  information-theoretic  properties 
of  languages  and  their  grammars.  These  properties  include,  for  languages,  their 
entropy,  average  string  length,  information  density  and  density  of  occurrence  for  a 
given  token.  For  grammars  we  have  shown  how  to  compute  average  derivation  length 
and  the  information  used  by  the  grammar  per  production. 

All  of  these  techniques  are  based  on  the  application  of  simple  recursive  formulas  to 
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annotated  grammars,  grammars  annotated  with  production  probabilities. 

that  the  same  techniques  can  be  applied  to  the  computation  or  many  other  prep,  jrties 

of  both  grammars  and  other  symbol  systems. 
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